Skip to content

Conversation

@AntonioMacaronio
Copy link
Contributor

@AntonioMacaronio AntonioMacaronio commented Mar 2, 2025

This PR is a followup to #3216 - its purpose is to add further documentation, fix bugs, and improve clarity

Problems and Background

  • One main issue is that on some CUDA versions and builds (such as WSL CUDA 11.8), GPU tensor serialization is very picky. When you put tensors on GPU in a worker process (via .to(self.device)), these GPU tensors cannot be properly serialized back to the main process! PyTorch attempts to serialize them but fails silently, resulting in zeroed tensors.
  • depth-nerfacto broken #3592 Updated the method_configs.py file to resolve this.
  • AssertionError when using ns-export #3586 ns-export cameras can also be resolved with the GPU fixes

Overview of Changes

  • Worker processes no longer move Pytorch tensors to the GPU. Workers keep tensors on the CPU until tensors are transferred to the main process, which moves them to GPU for forward/backward passes. This happens in both NeRF methods and 3DGS methods
  • New flowchart guides to train on large datasets!

TODOs

  • flowcharts

@abrahamezzeddine
Copy link

abrahamezzeddine commented Mar 3, 2025

Hello,

I am getting this bug when using load from disk. It is intermittent and works again after re-launching training.

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\Abraham\.conda\envs\nerfstudio\Scripts\ns-train.exe\__main__.py", line 7, in <module>
  File "O:\Gaussian\nerfstudio\nerfstudio\scripts\train.py", line 272, in entrypoint
    main(
  File "O:\Gaussian\nerfstudio\nerfstudio\scripts\train.py", line 257, in main
    launch(
  File "O:\Gaussian\nerfstudio\nerfstudio\scripts\train.py", line 190, in launch
    main_func(local_rank=0, world_size=world_size, config=config)
  File "O:\Gaussian\nerfstudio\nerfstudio\scripts\train.py", line 101, in train_loop
    trainer.train()
  File "O:\Gaussian\nerfstudio\nerfstudio\engine\trainer.py", line 266, in train
    loss, loss_dict, metrics_dict = self.train_iteration(step)
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "O:\Gaussian\nerfstudio\nerfstudio\utils\profiler.py", line 111, in inner
    out = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "O:\Gaussian\nerfstudio\nerfstudio\engine\trainer.py", line 502, in train_iteration
    _, loss_dict, metrics_dict = self.pipeline.get_train_loss_dict(step=step)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "O:\Gaussian\nerfstudio\nerfstudio\utils\profiler.py", line 111, in inner
    out = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "O:\Gaussian\nerfstudio\nerfstudio\pipelines\base_pipeline.py", line 298, in get_train_loss_dict
    ray_bundle, batch = self.datamanager.next_train(step)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "O:\Gaussian\nerfstudio\nerfstudio\data\datamanagers\full_images_datamanager.py", line 390, in next_train
    camera, data = next(self.iter_train_image_dataloader)[0]
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Abraham\.conda\envs\nerfstudio\Lib\site-packages\torch\utils\data\dataloader.py", line 630, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "C:\Users\Abraham\.conda\envs\nerfstudio\Lib\site-packages\torch\utils\data\dataloader.py", line 1344, in _next_data
    return self._process_data(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Abraham\.conda\envs\nerfstudio\Lib\site-packages\torch\utils\data\dataloader.py", line 1370, in _process_data
    data.reraise()
  File "C:\Users\Abraham\.conda\envs\nerfstudio\Lib\site-packages\torch\_utils.py", line 706, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File "C:\Users\Abraham\.conda\envs\nerfstudio\Lib\site-packages\torch\utils\data\_utils\worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Abraham\.conda\envs\nerfstudio\Lib\site-packages\torch\utils\data\_utils\fetch.py", line 33, in fetch
    data.append(next(self.dataset_iter))
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "O:\Gaussian\nerfstudio\nerfstudio\data\utils\dataloaders.py", line 651, in __iter__
    data[k] = data[k].to(self.device)
              ^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@AntonioMacaronio AntonioMacaronio marked this pull request as ready for review April 17, 2025 07:29
Copy link
Contributor

@akristoffersen akristoffersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I don't immediately see what would fix the cam_idx metadata from the cameras to show up here, but if you've validated it on your side its good with me!

Here, the variable 'batch' refers to the output of our pixel sampler.
- batch is a dict_keys(['image', 'indices'])
- batch['image'] returns a pytorch tensor with shape `torch.Size([4096, 3])` , where 4096 = num_rays_per_batch.
- batch['image'] returns a `torch.Size([4096, 3])` tensor on CPU, where 4096 = num_rays_per_batch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we support rgba supervision here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this does! When an RGBA image is present in the dataset, it gets converted into RGB format in the InputDataset.

Specifically this is what happens:

  1. dataloaders.py's RayBatchStream will call self.input_dataset.__getitem__
  2. InputDataset's __getitem__() method calls self.get_data
  3. get_data will call get_image_float32
  4. get_image_float32 has the code for RGBA support

@AntonioMacaronio
Copy link
Contributor Author

AntonioMacaronio commented Apr 20, 2025

@akristoffersen It's a little obscure how the camera metadata gets fixed but when a worker process sends a pytorch tensor or a tensor dataclass object (like cameras) to the main process, this tensor or tensor dataclass obj has to be on the CPU device. If it is on the GPU device, it will have CUDA context errors and/or other unpredictable behavior like the ones @abrahamezzeddine experienced

Here are some links talking about this:
https://discuss.pytorch.org/t/cuda-initialization-error-when-dataloader-with-cuda-tensor/43390/9
https://discuss.pytorch.org/t/dataloader-multiprocessing-with-dataset-returning-a-cuda-tensor/151022

@AntonioMacaronio AntonioMacaronio merged commit 94357f8 into nerfstudio-project:main Apr 20, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants